The aim of the project is to predict whether or not, tomorrow will
rain. Given some predictors, such as: Temperature,
Dew point, … the models will classify the following day as
“rainy” or not.
Install packages
# install.packages("tidymodels")
# install.packages("tidyverse")
# install.packages("xgboost")
# install.packages("ggplot2")
# install.packages("lubridate")
# install.packages("ranger")
# install.packages("vip")
# install.packages("DiagrammeR")
# install.packages("GGally")
Import the libraries used in the project
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(lubridate)
library(ranger)
library(vip)
library(xgboost)
library(DiagrammeR)
library(GGally)
library(rmarkdown)
The dataset used in this project is composed of hourly observation of weather related features. Available https://huggingface.co/datasets/sayanroy058/Weather-Time-Series-Forecasting
data <- read_csv("dataset.csv")
A first visualization of the data
paged_table(head(data))
Check for missing values in the dataset
sum(is.na(data))
[1] 0
From a first sight the column snowfall seems empty (all
equal to 0), so if that’s the case the column will be removed; this
imply that precipitation and rain are the same, so precipitation will be
removed.
# Check if the 'snow' column is all zeros and remove it if true
if (all(data$`snowfall (cm)` == 0)) {
data <- data %>% select(-`snowfall (cm)`)
}
# If 'precipitation' is equivalent to 'rain', remove 'precipitation'
if (all(data$`precipitation (mm)` == data$`rain (mm)`)) {
data <- data %>% select(-`precipitation (mm)`)
}
Now in order to proceed with the computation, the time
variable can be grouped into daily data averaging by the mean. The
averagin by the mean will per performed to those values which are
“constrained” such as wind degree and so on, the amount of rain will be
summed, since it’s a quantity not constrained.
data <- data %>%
mutate(time = as.Date(time)) %>%
group_by(time) %>%
summarise(across(-`rain (mm)`, mean), `rain (mm)` = sum(`rain (mm)`))
The new grouped data:
paged_table(head(data))
Since the data was reduced and grouped, check again for null values
sum(is.na(data))
[1] 0
Now that the data is ready the data summary can be printed
summary(data)
time temperature relative_humidity dew_point
Min. :1980-01-01 Min. :14.05 Min. :16.21 Min. :-6.704
1st Qu.:1991-02-09 1st Qu.:22.33 1st Qu.:47.58 1st Qu.:10.329
Median :2002-03-20 Median :23.68 Median :60.96 Median :15.613
Mean :2002-03-20 Mean :24.13 Mean :63.95 Mean :15.233
3rd Qu.:2013-04-28 3rd Qu.:25.96 3rd Qu.:85.38 3rd Qu.:21.046
Max. :2024-06-06 Max. :34.63 Max. :98.96 Max. :23.642
pressure_msl (hPa) surface_pressure (hPa) cloud_cover (%)
Min. : 994.4 Min. :930.3 Min. : 0.000
1st Qu.:1007.5 1st Qu.:942.8 1st Qu.: 2.708
Median :1010.5 Median :945.7 Median : 16.958
Mean :1010.4 Mean :945.3 Mean : 35.153
3rd Qu.:1013.6 3rd Qu.:948.0 3rd Qu.: 74.250
Max. :1021.2 Max. :954.0 Max. :100.000
cloud_cover_low (%) cloud_cover_mid (%) cloud_cover_high (%)
Min. : 0.000 Min. : 0.000 Min. : 0.00
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00
Median : 3.417 Median : 8.375 Median : 10.17
Mean :24.285 Mean : 16.730 Mean : 30.08
3rd Qu.:53.667 3rd Qu.: 23.542 3rd Qu.: 60.67
Max. :99.750 Max. :100.000 Max. :100.00
vapour_pressure_deficit (kPa) wind_speed_10m (km/h) wind_direction
Min. :0.02833 Min. : 2.583 Min. : 15.25
1st Qu.:0.48083 1st Qu.: 7.162 1st Qu.:148.96
Median :1.27542 Median : 9.300 Median :236.54
Mean :1.28585 Mean :10.992 Mean :204.20
3rd Qu.:1.79458 3rd Qu.:14.117 3rd Qu.:258.54
Max. :4.68750 Max. :31.654 Max. :338.17
is_Day rain (mm)
Min. :0.4583 Min. : 0.000
1st Qu.:0.4583 1st Qu.: 0.000
Median :0.5000 Median : 0.000
Mean :0.5072 Mean : 2.747
3rd Qu.:0.5417 3rd Qu.: 2.000
Max. :0.5417 Max. :355.200
All the variables are numeric, but some of them, such as
wind_direction has some restriction.
wind_direction needs to be between 0-360; all the
percentage variables, needs to range between 0 and 100. All this checks
can be done by looking at the min and max
values of the variables.
Since the time variable was grouped, is_Day
is not any more useful now.
data <- data %>%
select(-is_Day)
paged_table(head(data))
The aim of the project will be to predict whereas tomorrow will rain,
so a new predictor called willTomorrowRain is created.
having value 1 if at least one millimeter of rain is
recorded in the following day in the variable rain (mm);
0 otherwise
data <- data %>%
mutate(willTomorrowRain = ifelse(lead(`rain (mm)`) > 0, 1, 0))
Set the response variable as a factor
data$willTomorrowRain <- as.factor(data$willTomorrowRain)
data <- data %>%
filter(!is.na(willTomorrowRain))
paged_table(head(data))
Plot the response variable to see the distribution
data %>%
ggplot(aes(x = willTomorrowRain)) +
geom_bar()
Now we plot the distribution of the features to get more insights into the distribution of them
data %>%
select(-time, -willTomorrowRain) %>%
gather(key = "feature", value = "value") %>%
ggplot(aes(x = value)) +
geom_density() +
facet_wrap(~feature, scales = "free")
Another way to visualize the data is the use of boxplots.
data %>%
select(-time, -willTomorrowRain) %>%
gather(key = "feature", value = "value") %>%
ggplot(aes(x = feature, y = value)) +
geom_boxplot() +
coord_flip()
By looking at the plot is evident that the variables have different scales and/or are skewed, so the data needs to be normalized.
data <- data %>%
select(-time) %>%
mutate_if(is.numeric, scale)
The procedure create some NaN values which will set to
0
data[is.na(data)] <- 0
Check for collinearity between predictors
data %>%
select(-willTomorrowRain) %>%
cor() %>%
corrplot::corrplot()
As expected some of the values are correlated, such as
relative_humidity and
vapour_pressure_deficit (kPa). A scatter plot will gave
more
data %>%
ggplot(aes(x = relative_humidity, y = `vapour_pressure_deficit (kPa)`)) +
geom_point()
There is a correlation between those two variables, but it’s not completley linear
data %>%
ggplot(aes(x = `pressure_msl (hPa)`, y = `surface_pressure (hPa)`)) +
geom_point()
Whereas this two variables are more correlated so only one them can be maintened
paged_table(data %>% select(-`pressure_msl (hPa)`))
Also is worth checking the cloud related variables
data %>%
ggplot(aes(x = `cloud_cover_low (%)`, y = `cloud_cover_mid (%)`)) +
geom_point()
data %>%
ggplot(aes(x = `cloud_cover_mid (%)`, y = `cloud_cover_high (%)`)) +
geom_point()
data %>%
ggplot(aes(x = `cloud_cover (%)`, y = `cloud_cover_mid (%)`)) +
geom_point()
From this scatter plots there is no clear evidence of a linear correlation between clouds variables.
Plot again the predictors after being normalized
data %>%
select(-willTomorrowRain) %>%
gather(key = "feature", value = "value") %>%
ggplot(aes(x = value)) +
geom_density() +
facet_wrap(~feature, scales = "free")
paged_table(head(data))
Set the seed in order to allow reproducibility of the
project
set.seed(0)
Split the data into train and test set
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
train <- data[train_index, ]
test <- data[-train_index, ]
The first model implemented is the random forest model
tune_rf_spec <- rand_forest(
mtry = 13,
trees = 2000,
mode = "classification"
) %>%
set_engine("ranger", importance = "impurity")
# Create a recipe for pre-processing
rain_rf_recipe <- recipe(willTomorrowRain ~ ., data = train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
# Create a workflow to combine the model and the recipe
rf_workflow <- workflow() %>%
add_model(tune_rf_spec) %>%
add_recipe(rain_rf_recipe)
Fit the model to the train
rf_fit <- fit(rf_workflow, data = train)
Evaluate performances
# Evaluate the model's performance on the training set
train_predictions <- rf_fit %>%
predict(train) %>%
bind_cols(train)
# Evaluate the model's performance on the test set
test_predictions <- rf_fit %>%
predict(test) %>%
bind_cols(test)
Calculate metrics
# Calculate metrics for training set
train_metrics <- train_predictions %>%
metrics(truth = willTomorrowRain, estimate = .pred_class)
# Calculate metrics for test set
test_metrics <- test_predictions %>%
metrics(truth = willTomorrowRain, estimate = .pred_class)
paged_table(train_metrics)
paged_table(test_metrics)
From this output a little but of overfitting is present since the accuracy on the test set is lower compared to the one on the training set
# Variable importance
rf_fit %>%
pull_workflow_fit() %>%
vip(num_features = 13)
Can be trivial to say that the most important variable is the
rainfall of the previous day, but that variable can also be equal to 0
and combined with dew_point leads to a good prediction. Now
to be sure, the same task can be performed but without
rain (mm).
tune_rf_spec <- rand_forest(
mtry = 12,
trees = 2000,
mode = "classification"
) %>%
set_engine("ranger", importance = "impurity")
# Create a recipe for pre-processing
rain_rf_recipe <- recipe(willTomorrowRain ~ ., data = train) %>%
step_rm(`rain (mm)`) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
# Create a workflow to combine the model and the recipe
rf_workflow <- workflow() %>%
add_model(tune_rf_spec) %>%
add_recipe(rain_rf_recipe)
rf_fit <- fit(rf_workflow, data = train)
# Evaluate the model's performance on the training set
train_predictions <- rf_fit %>%
predict(train) %>%
bind_cols(train)
# Evaluate the model's performance on the test set
test_predictions <- rf_fit %>%
predict(test) %>%
bind_cols(test)
# Calculate metrics for training set
train_metrics <- train_predictions %>%
metrics(truth = willTomorrowRain, estimate = .pred_class)
# Calculate metrics for test set
test_metrics <- test_predictions %>%
metrics(truth = willTomorrowRain, estimate = .pred_class)
paged_table(train_metrics)
paged_table(test_metrics)
# Variable importance
rf_fit %>%
pull_workflow_fit() %>%
vip(num_features = 13)
The results are similar to the previous and the most important
variable now is dew_point.
In order to compare the accuracy, a shallow forecast method is produced. The method will forecast the tomorrows rain only if it rained today
train %>%
mutate(
willTomorrowRain = as.numeric(as.character(willTomorrowRain)),
predictedRain = ifelse(lag(willTomorrowRain, default = 0) == 0, 0, 1)
) %>%
summarise(
accuracy = sum(willTomorrowRain == predictedRain) / n()
)
# A tibble: 1 × 1
accuracy
<dbl>
1 0.517
The result is almost a random guess.
Now a smaller forest can be tried to compare the results since overfitting was experienced
# Define the random forest model specification
tune_rf_spec <- rand_forest(
mtry = 13,
trees = 250,
mode = "classification"
) %>%
set_engine("ranger", importance = "impurity")
# Create a workflow to combine the model and the recipe
rf_workflow <- workflow() %>%
add_model(tune_rf_spec) %>%
add_recipe(rain_rf_recipe)
rf_fit <- fit(rf_workflow, data = train)
# Evaluate the model's performance on the training set
train_predictions <- rf_fit %>%
predict(train) %>%
bind_cols(train)
# Evaluate the model's performance on the test set
test_predictions <- rf_fit %>%
predict(test) %>%
bind_cols(test)
# Calculate metrics for training set
train_metrics <- train_predictions %>%
metrics(truth = willTomorrowRain, estimate = .pred_class)
# Calculate metrics for test set
test_metrics <- test_predictions %>%
metrics(truth = willTomorrowRain, estimate = .pred_class)
paged_table(train_metrics)
paged_table(test_metrics)
Variable importance
# Variable importance
rf_fit %>%
pull_workflow_fit() %>%
vip(num_features = 13)
The results are similar to the previous but slightly better.
Now a random search on the space can be performed to find better results. A grid search can also be performed to find the global optimum in the search space, but it can be computationally expensive.
# Define the random forest model specification with tuning parameters
tune_rf_spec <- rand_forest(
mtry = tune(),
trees = tune(),
min_n = tune(),
mode = "classification"
) %>%
set_engine("ranger", num.threads = 7)
# Create cross-validation folds
data_folds <- vfold_cv(data)
# Create a recipe for pre-processing
# Create a recipe for pre-processing
rain_rf_recipe <- recipe(willTomorrowRain ~ ., data = train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
# Create a workflow to combine the model and the recipe
rain_rf_wflow <- workflow() %>%
add_model(tune_rf_spec) %>%
add_recipe(rain_rf_recipe)
# Define the metrics to evaluate
metr_res <- metric_set(kap, accuracy)
# Perform random search tuning
rf_reg_tune <- rain_rf_wflow %>%
tune_grid(
data_folds,
grid = grid_random(
trees(range = c(100, 1000)),
mtry(range = c(8, 13)),
min_n(range = c(6, 8)),
size = 5
),
metrics = metr_res
)
# Collect and print the metrics
paged_table(rf_reg_tune %>% collect_metrics())
# Select the best configuration based on the "kap" metric
best_f1 <- rf_reg_tune %>% select_best(metric = "kap")
# Finalize the model with the best configuration
final_rf <- finalize_model(tune_rf_spec, best_f1)
# Print the best result configuration
print(best_f1)
# A tibble: 1 × 4
mtry trees min_n .config
<int> <int> <int> <chr>
1 9 626 7 Preprocessor1_Model5
# Output the final model
final_rf
Random Forest Model Specification (classification)
Main Arguments:
mtry = 9
trees = 626
min_n = 7
Engine-Specific Arguments:
num.threads = 7
Computational engine: ranger
Now another model is implemented, the Extreme Gradient Boosting.
# Define the XGBoost model specification with tuning parameters
tune_btree_spec <- boost_tree(
learn_rate = tune(),
tree_depth = tune(),
trees = tune(),
mode = "classification"
) %>%
set_engine("xgboost", nthreads = parallel::detectCores())
# Create cross-validation folds
rain_folds <- vfold_cv(train)
# Create a recipe for preprocessing
rain_rf_recipe <- recipe(willTomorrowRain ~ ., data = train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
# Create a workflow to combine the model and the recipe
btree_wflow <- workflow() %>%
add_model(tune_btree_spec) %>%
add_recipe(rain_rf_recipe)
# Define the metrics to evaluate
metr_res <- metric_set(kap, accuracy)
# Perform random search tuning
btree_reg_tune <- btree_wflow %>%
tune_grid(
rain_folds,
grid = grid_random(
learn_rate(),
tree_depth(range = c(2, 8)),
trees(range = c(5, 20)),
size = 10
),
metrics = metr_res
)
# Collect and print the metrics
paged_table(btree_reg_tune %>% collect_metrics())
# Select the best configuration based on the "kap" metric
best_btree_f1 <- btree_reg_tune %>% select_best(metric = "kap")
# Finalize the model with the best configuration
final_btree <- finalize_model(tune_btree_spec, best_btree_f1)
# Print the best result configuration
print(best_btree_f1)
# A tibble: 1 × 4
trees tree_depth learn_rate .config
<int> <int> <dbl> <chr>
1 15 4 0.0118 Preprocessor1_Model08
# Output the final model
final_btree
Boosted Tree Model Specification (classification)
Main Arguments:
trees = 15
tree_depth = 4
learn_rate = 0.0118445504772027
Engine-Specific Arguments:
nthreads = parallel::detectCores()
Computational engine: xgboost
Calculate the accuracy
# Define the formula for the model
formula <- willTomorrowRain ~ .
# Fit the model to the training data using the formula
final_btree_fit <- fit(final_btree, formula = formula, data = train)
[14:06:28] WARNING: src/learner.cc:767:
Parameters: { "nthreads" } are not used.
# Evaluate the model's performance on the training set
train_predictions <- final_btree_fit %>%
predict(train) %>%
bind_cols(train)
# Evaluate the model's performance on the test set
test_predictions <- final_btree_fit %>%
predict(test) %>%
bind_cols(test)
# Calculate metrics for training set
train_metrics <- train_predictions %>%
metrics(truth = willTomorrowRain, estimate = .pred_class)
# Calculate metrics for test set
test_metrics <- test_predictions %>%
metrics(truth = willTomorrowRain, estimate = .pred_class)
Print the train metrics
paged_table(train_metrics)
Print the test metrics
paged_table(test_metrics)
# Assuming final_btree is the finalized model specification
# Create a new workflow with the finalized model
final_btree_wflow <- workflow() %>%
add_model(final_btree) %>%
add_recipe(rain_rf_recipe)
# Fit the workflow to the training data
final_btree_fit <- fit(final_btree_wflow, data = train)
[14:06:28] WARNING: src/learner.cc:767:
Parameters: { "nthreads" } are not used.
# Plot the variable importance
final_btree_fit %>%
pull_workflow_fit() %>%
vip(num_features = 13)
The first tree can be plotted to better visualize the splitting performed
xgb_model <- final_btree_fit %>%
pull_workflow_fit() %>%
.$fit
# Visualize the first tree in the model
xgb.plot.tree(model = xgb_model, trees = 0)
This project evaluated the effectiveness of Random Forest and Extreme Gradient Boosting (XGBoost) in the classification of weather data to predict the likelihood of rain on the following day. The dataset, originally structured as a time series, was preprocessed to be treated as independent observations for modeling purposes.
The results demonstrated that Random Forest and XGBoost obtained similar results in this particular case. Both models highlighted the significance of the amount of rainfall on the current day as the primary predictor of whether it would rain the next day. The presence or absence of rainfall was a critical factor in determining whether the subsequent day would also be classified as rainy. However, one notable limitation of this approach is that rain occurring overnight (for example, during a storm that spans midnight) may result in the following day being classified as rainy, even if precipitation occurred only briefly after midnight. To account for this limitation, the model was tried without that predictor and lead to similar results.
The second most important variable identified by both models was the dew_point, indicating the relevance of the relationship between humidity and temperature in predicting rainfall. This finding suggests that atmospheric moisture, as captured by the dew point, plays a critical role in determining rain conditions for the next day.